Search CORE

10 research outputs found

Reviewer Integration and Performance Measurement for Malware Detection

Author: Afroz Sadia
Bachwani Rekha
Faizullabhoy Riyaz
Huang Ling
Joseph Anthony D.
Kantchelian Alex
Miller Brad
Shankar Vaishaal
Tschantz Michael Carl
Tygar J. D.
Wu Tony
Yiu George
Publication venue
Publication date: 26/05/2016
Field of study

We present and evaluate a large-scale malware detection system integrating machine learning with expert reviewers, treating reviewers as a limited labeling resource. We demonstrate that even in small numbers, reviewers can vastly improve the system's ability to keep pace with evolving threats. We conduct our evaluation on a sample of VirusTotal submissions spanning 2.5 years and containing 1.1 million binaries with 778GB of raw feature data. Without reviewer assistance, we achieve 72% detection at a 0.5% false positive rate, performing comparable to the best vendors on VirusTotal. Given a budget of 80 accurate reviews daily, we improve detection to 89% and are able to detect 42% of malicious binaries undetected upon initial submission to VirusTotal. Additionally, we identify a previously unnoticed temporal inconsistency in the labeling of training datasets. We compare the impact of training labels obtained at the same time training data is first seen with training labels obtained months later. We find that using training labels obtained well after samples appear, and thus unavailable in practice for current training data, inflates measured detection by almost 20 percentage points. We release our cluster-based implementation, as well as a list of all hashes in our evaluation and 3% of our entire dataset.Comment: 20 papers, 11 figures, accepted at the 13th Conference on Detection of Intrusions and Malware & Vulnerability Assessment (DIMVA 2016

arXiv.org e-Print Archive

Crossref

Complaint-driven Training Data Debugging for Query 2.0

Author: Abuzaid Firas
Agarwal Alekh
Boehm Matthias
Chapman Adriane
Gilpin Leilani H.
Giordano Ryan
Green Todd J.
Kang Daniel
Kantchelian Alex
Khanna Rajiv
Koh Pang Wei
Konda Pradap
Krishnan Sanjay
Li Yuliang
Matthew
Metsis Vangelis
Rahm Erhard
Ribeiro Marco Túlio
Ré Christopher
Shrikumar Avanti
Sundararajan Mukund
Tanaka Daiki
Xu Jingyi
Zhang Xuezhou
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 12/04/2020
Field of study

As the need for machine learning (ML) increases rapidly across all industry sectors, there is a significant interest among commercial database providers to support "Query 2.0", which integrates model inference into SQL queries. Debugging Query 2.0 is very challenging since an unexpected query result may be caused by the bugs in training data (e.g., wrong labels, corrupted features). In response, we propose Rain, a complaint-driven training data debugging system. Rain allows users to specify complaints over the query's intermediate or final output, and aims to return a minimum set of training examples so that if they were removed, the complaints would be resolved. To the best of our knowledge, we are the first to study this problem. A naive solution requires retraining an exponential number of ML models. We propose two novel heuristic approaches based on influence functions which both require linear retraining steps. We provide an in-depth analytical and empirical analysis of the two approaches and conduct extensive experiments to evaluate their effectiveness using four real-world datasets. Results show that Rain achieves the highest recall@k among all the baselines while still returns results interactively.Comment: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Dat

arXiv.org e-Print Archive

Crossref

Taming Evasions in Machine Learning Based Detection Pipelines

Author: Kantchelian Alex
Publication venue: eScholarship, University of California
Publication date: 01/01/2016
Field of study

This thesis presents and evaluates three mitigation techniques for evasion attacks against machine learning based detection pipelines. Machine learning based detection pipelines provide much of the security in modern computerized system. For instance, these pipelines are responsible for the detection of undesirable content on computing platforms and Internet-based services, such as malicious software and email spam. By its adversarial nature, the security application domain exhibits a permanent arms race between attackers who aim to avoid, or evade, detection and the pipeline's maintainers whose aim is to catch all undesirable content.The first part of this thesis examines a defense technique for the concrete application domain of comment spam on social media. We propose content complexity, a compression-based normalized measure of textual redundancy that is mostly insensitive to the underlying language used and adversarial word spelling variations. We demonstrate on a real dataset of tens of millions of comments that content complexity alone achieves 15 percentage points higher precision than a state-of-the-art detection system.The second part of this thesis takes a quantitative approach to evasion and introduces one machine learning algorithm and one learning framework for building hardened detection pipelines. Both techniques are generic and suitable for a large class of application domains. We propose the convex polytope machine, a non-linear large-scale learning algorithm which aims at finding a large-margin polytope separator and thereby decrease the effectiveness of evasion attacks. We show that as a general purpose machine learning algorithm, the convex polytope machine displays an outstanding trade-off between classification accuracy and computational efficiency. We also demonstrate on a benchmark handwritten digit recognition task that the convex polytope machine is quantitatively as evasion-resistant as a classic neural network.We finally introduce adversarial boosting, a boosting-inspired framework for iteratively building ensemble classifiers that are hardened against evasion attacks. Adversarial boosting operates by repeatedly constructing evasion attacks and adding the corresponding corrective sub-classifiers to the ensemble. We implement this technique for decision tree sub-classifiers by constructing the first exact and approximate automatic evasion algorithms for tree ensembles. For our benchmark task, the adversarially boosted tree ensemble is respectively five times and two times less evasion-susceptible than regular tree ensembles and the convex polytope machine

Ezid

eScholarship - University of California

Large-margin convex polytope machine

Author: Bartlett Peter
Huang Ling
Joseph Anthony
Kantchelian Alex
Tschantz Michael
Tygar Doug (J.D.)
Publication venue: Morgan Kaufmann Publishers, Inc.
Publication date: 01/01/2014
Field of study

We present the Convex Polytope Machine (CPM), a novel non-linear learning algorithm for large-scale binary classification tasks. The CPM finds a large margin convex polytope separator which encloses one class. We develop a stochastic gradient descent based algorithm that is amenable to massive datasets, and augment it with a heuristic procedure to avoid sub-optimal local minima. Our experimental evaluations of the CPM on large-scale datasets from distinct domains (MNIST handwritten digit recognition, text topic, and web security) demonstrate that the CPM trains models faster, sometimes several orders of magnitude, than state-of-the-art similar approaches and kernel-SVM methods while achieving comparable or better classification performance. Our empirical results suggest that, unlike prior similar approaches, we do not need to control the number of sub-classifiers (sides of the polytope) to avoid overfitting

Queensland University of Technology ePrints Archive

Robust detection of comment spam using entropy rate

Author: Alex Kantchelian
Anthony D. Joseph
J. D. Tygar
Justin Ma
Ling Huang
Sadia Afroz
Publication venue
Publication date: 01/01/2012
Field of study

In this work, we design a method for blog comment spam detection using the assumption that spam is any kind of uninformative content. To measure the “informativeness ” of a set of blog comments, we construct a language and tokenization independent metric which we call content complexity, providing a normalized answer to the informal question “how much information does this text contain? ” We leverage this metric to create a small set of features well-adjusted to comment spam detection by computing the content complexity over groupings of messages sharing the same author, the same sender IP, the same included links, etc. We evaluate our method against an exact set of tens of millions of comments collected over a four months period and containing a variety of websites, including blogs and news sites. The data was provided to us with an initial spam labeling from an industry competitive source. Nevertheless the initial spam labeling had unknown performance characteristics. To train a logistic regression on this dataset using our features, we derive a simple mislabeling tolerant logistic regression algorithm based on expectationmaximization, which we show generally outperforms the plain version in precision-recall space. By using a parsimonious hand-labeling strategy, we show that our method can operate at an arbitrary high precision level, and that it significantly dominates, both in terms of precision and recall, the original labeling, despite being trained on it alone. The content complexity metric, the use of a noise-tolerant logistic regression and the evaluation methodology are thus the three central contributions with this work

CiteSeerX

Crossref

Large-Margin Convex Polytope Machine

Author: Alex Kantchelian
Anthony D Joseph
J D Tygar
Ling Huang
Michael Carl Tschantz
Peter L Bartlett
Publication venue
Publication date: 05/03/2020
Field of study

We present the Convex Polytope Machine (CPM), a novel non-linear learning algorithm for large-scale binary classification tasks. The CPM finds a large margin convex polytope separator which encloses one class. We develop a stochastic gradient descent based algorithm that is amenable to massive datasets, and augment it with a heuristic procedure to avoid sub-optimal local minima. Our experimental evaluations of the CPM on large-scale datasets from distinct domains (MNIST handwritten digit recognition, text topic, and web security) demonstrate that the CPM trains models faster, sometimes several orders of magnitude, than state-ofthe-art similar approaches and kernel-SVM methods while achieving comparable or better classification performance. Our empirical results suggest that, unlike prior similar approaches, we do not need to control the number of sub-classifiers (sides of the polytope) to avoid overfitting

CiteSeerX

Measuring and Analysing the Chain of Implicit Trust

Author: Bashir Muhammad Ahmad
Bencsáth Boldizsár
Bertolino Antonia
Canto Julio
Gomer Richard
Howard Fraser
Ibosiola Damilola
Jerome Sequa
Kantchelian Alex
Labs Malwarebytes
Pellegrino Giancarlo
Wang Xiao Sophia
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

Adversarial Machine Learning

Author: Alfeld Scott
Alfeld Scott
Aubin Robert St.
Bhagoji Arjun Nitin
Biggio B.
Biggio Battista
Biggio Battista
Bishop Christopher M.
Bojarski Mariusz
Boutilier Craig
Brückner Michael
Cauwenberghs Gert
Demontis Ambra
Evtimov Ivan
Feng Jiashi
Fogla Prahlad
Fudenberg Drew
Goodfellow Ian
Goodfellow Ian J
Grosse Kathrin
Grosshans Michael
Guarnieri Claudio
Guestrin Carlos
Hajaj Chen
Jagielski Matthew
Kantchelian Alex
Kloft Marius
Koh Pang Wei
Kurakin Alexey
Li Bo
Li Bo
Li Bo
Lowd Daniel
Madry Aleksander
Martello S.
Mei Shike
Mei Shike
Natarajan Nagarajan
Nelson Blaine
Papernot Nicolas
Papernot Nicolas
Raghunathan Aditi
Rouhani Bita Darvish
Sharif Mahmood
Smith Andrew
Sra Suvrit
Steinhardt Jacob
Suciu Octavian
Sutton Richard S.
Szegedy Christian
Teo Choon Hai
Tong Liang
Tong Liang
Valiant Leslie
Vorobeychik Yevgeniy
Vovk Vladimir
Wang Gang
Welling Max
Wong Eric
Xiao Chaowei
Xu Huan
Publication venue: 'Morgan & Claypool Publishers LLC'
Publication date
Field of study

Crossref